Guidelines

Other guidelines in this publication lay out many aspects of platform technology that could be considered, but in brief, here are some indicators that a platform may have features that facilitate preservation: The platform utilizes appropriate standards relevant to the publishing community e.g. standardized metadata, exports to common formats, accessibility standards. The platform uses established technologies rather than being dependent on newer more experimental technologies that may not be well supported. The platform itself is well established and broadly adopted. There are existing workflows for preservation. The platform has a comprehensive export option that includes all raw materials, dependencies (e.g. fonts), descriptive metadata, and packaging metadata that describe how it all fits together. The export package supports, through completeness and use of standards, a complete migration to a new platform with equivalent features, rather than being closely tied to the current platform. In the absence of an export, the platform includes a predictable structure or API that could facilitate content discovery, enumeration, or harvesting from an external source. Finally, the platform does not have an over-abundance of built-in features that will not be used, as these can add bulk and complexity to preservation workflows.

If you are developing a new publishing platform, or have control over how publishing platform features are designed or implemented, use existing standards to guide decisions. For example, there are standards for bibliographic data (e.g. ONIX, Dublin Core), full-text data (e.g. TEI, EPUB), annotations (e.g. W3C’s Web Annotation Data Model), persistent identifiers (e.g. DOIs, Handles, ARK IDs), citations (e.g. MLA, BibTeX), metrics (e.g. COUNTER), accessibility (e.g. W3C’s Web Content Accessibility Guidelines) and more. Preservation workflows scale best when working with common standards.

When using out-of-the-box software solutions for your publishing platform, export and preservation workflows are often designed around the built-in functionality of that software. For this reason, it is helpful to use platform features as intended. If the built-in functionality of the publishing software does not meet local requirements, avoid making undocumented, one-off changes to core code in order to get something working quickly. Instead, attempt to formalize and document any changes to the out-of-the-box software so that the new functionality is reusable in other publications and internally consistent within the platform. If the platform software has a formal process for applying enhancements (e.g. a plugin process), make use of this. Ensure any export processes are modified to align with the local changes and if working with a preservation partner, communicate any local changes to the software. The risk of not following a formal process may be loss of the new features during preservation, updates, or platform migration. An undocumented customization can disrupt the preservation of entire publications.

These other guidelines may be helpful when implementing new features:
3. Use existing standards when implementing features
5. Establish formatting rules for common features
6. Keep preservation partners informed of changes that affect the publications

Consistency is the key to a scalable preservation workflow, and so if a publisher or platform supports multimedia content or other enhanced features, establish basic rules early on and continue to express these features in a consistent way. Limit formats and arrangements as much as possible. For example, if one embedded video is an MP4 with no caption, another is a WebM and has a caption in a box, and another still is a Vimeo video with a caption but no box, for some approaches these minor inconsistencies can cause problems when performing preservation activities at scale. These potential variations should be clearly defined and constrained.

These other guidelines may be helpful when implementing new features:
3. Use existing standards when implementing features
6. Keep preservation partners informed of changes that affect the publications
70. Consider systematically tagging material that should be excluded or tagging material that should be included as part of the preserved content

For platforms and publishers working with a preservation service, preservation workflows will be designed based on the sample publications provided. If these are not representative of the full range of functionality that the publishing platform supports, then the preservation workflow developed may miss things that the publisher wants to preserve. Keep a record of the scope of variations that might be found in a publication. As formatting rules for a publication change, expand, or new file formats or arrangements can be expected, inform your preservation service so that they can adapt their workflows accordingly and avoid missing important features.

For more about changes that should be communicated to a preservation service:
4. Document any changes to the default functionality of a platform
5. Establish and document basic formatting rules
10. Define and document the core intellectual components of a work
70. Consider systematically tagging material that should be excluded from preservation
71. Document and share the platform-level approach to preserving components of a publication

If a publication platform enables user contributed content and that content is managed by the platform, e.g. annotations or comments, the platform’s Terms of Use should clearly define the rights related to that content, especially if they may wish to preserve it or migrate it as part of the context of the publication. If a publication is likely to be archived with this context intact, the implementation of these features and their associated terms should factor in ethical consideration of how a user’s information is displayed on the platform, and how they are informed about and consent to the use of the content.

See also:
55. Ethical concerns of user-contributed content
70. Consider systematically tagging material that should be excluded from preservation

PubPub supports features that allow users to contribute content through annotations and comments. This content is integrated into the page and can’t be excluded from web crawls. The default PubPub Terms of Service template includes language that covers User-Generated Content under a Creative Commons Attribution 4.0 License:

By submitting User-Generated Content, you hereby make that User-Generated Content available under the Creative Commons Attribution 4.0 License, and you represent and warrant that you have the right to provide your User Generated Content under that license, that all of that User Generated Content is either authored by you, or provided by third parties under the Creative Commons Attribution 4.0 License or in the public domain, and that your User Generated Content contains no personally identifiable information of third parties who have not expressly authorized you to provide it as part of your User Generated Content. All of your User-Generated Content must be appropriately marked with licensing and attribution information.

These terms allow for preservation of User-Generated Content on PubPub.

If a publication platform integrates third party applications for features such as annotations or comments, the publisher should ensure that the terms of service for that application provide appropriate permission for preserving and migrating that content over time.

See also:
14. Avoid being dependent on third party services for core features
15. Plan a strategy for preservation when third party dependencies exist

Some third-party annotation services have restrictive default terms of service or do not define their terms of service. Hypothesis, an annotation tool that can be added to or used with most websites, grants a CC0 license for all annotation data stored on their servers. This means you don’t need to seek special permission to preserve the annotation data.

A preservation service will work with a publisher to determine the version(s) of record. If there may be multiple versions of record, or if draft versions are considered significant, the parameters of these should be clearly defined. In addition, these versions should be identified in a formal way so that automated updates can occur as needed while retaining clarity across the preservation copies.

These guidelines relate to other aspects of versioning:
23. Express versioning in bibliographic metadata
31. Assign new identifiers to significant versions

When referencing an external resource in a publication, see if there is a version of the resource that has a unique persistent identifier and if so use that identifier to reference it. While all “persistent” identifiers can eventually break depending on whether they are properly maintained, they are more likely to last than other links and uniquely identify a resource. Another option for tackling “link rot”—the term for when links stop working—is to use a web archiving snapshot service such as archive.today or Internet Archive’s Save Page Now service to archive the page and reference the resulting snapshot as an alternative link in the document. Robust Links are one way to present this to users.

These guidelines cover other instances that may benefit from use of identifiers:
27. Assign persistent identifiers to publication resources and use them
31. Assign identifiers to significant new versions of the work

If a publication is document-like, exporting and transforming the core intellectual components to an existing standard for full text publications e.g. to EPUB, TEI, or JATS/BITS XML is a robust approach. This includes publications that contain multimedia or remote content since these enhanced features can be managed more easily at scale when the rest of the publication is expressed in a standard form. Existing standards can be validated at scale, support both platform migration and preservation, and may steer enhanced features to be expressed more consistently to work with the document.

These guidelines may also be helpful when considering the export package for a linear publication:
3. Use existing standards for export formats
10. Identify and document the core intellectual components of a work
20. Ensure exports cover all core intellectual components

An excessive number of small metadata files or a complex folder hierarchy within an export package adds complexity to the workflow. Ideally, export processes consolidate metadata into one file per publication, and the folder and file structure are mostly flattened, predictable, and use a consistent naming convention. Metadata should be fully expressed within the metadata file, not via filenames and folder names, and should include references to the files being described so that they are easily connected. The complexity of a submitted information package has an impact on the ability of a preservation service to efficiently and quickly convert it to an archival information package. Reducing the number of separate metadata files and folders reduces processing time and can improve stability in the long term by simplifying migration either to a preservation system or to another platform. To the extent that the goal is an automated preservation workflow, the export packages should be consistent across publications.

See also:
22. Use an appropriate metadata serialization within the export package

In addition to the main text and embedded or supplemental media, other features or content such as annotations, high-quality versions of media, supporting data, a visual walk through with the author, and peer reviews may be considered integral to the work in some cases. If so, these resources should be part of the export package so that they can be preserved alongside the publication. Special provisions may need to be made for artifacts that are hosted outside of the platform to include them in the export.

See also:
10. Identify and document the core intellectual components of a work
72. Create a video walkthrough of any complex features

Current publishing platforms can support frequent updates and new versions. These should be expressed clearly through the metadata so that the preserved copies can be properly distinguished from each other. If something has changed, it should be reflected in the version and date and where necessary, new exports should be provided.

These guidelines also relate to versioning:
9. Determine the version of record in you context
31. Assign new identifiers to significant versions of a work

Fulcrum has structured its export packages, which include EPUBs, to support preservation. The enhanced media viewers that are used in the online version of Fulcrum EPUBs will not work if the Fulcrum platform is no longer available. To help ensure the EPUBs will continue to have essential functionality over the long term, the export process simplifies these features. For photos, it embeds a static view of the photo inside the EPUB instead of depending on a IIIF viewer. For audio and video, it displays a DOI link to the media resource instead of retaining the enhanced media players for these features, since these will not work if Fulcrum is unavailable. Where the players were once embedded in the EPUB, instead a persistent DOI link is displayed to point to the current location of that resource. The export package also includes all media files, as well as a CSV registry that indicates which DOI points to which file, so that the linked file can be identified even if the DOI does not resolve. These features are all applied in a way that conforms to the EPUB 3 standard.

Many publication resources that are supported by modern publishing platforms warrant their own description to ensure they are properly credited, interpreted, and rendered with context in the future. Where possible, include descriptive metadata for each resource. Use an existing standard for guidance on what to include, e.g. Dublin Core. A publisher may be able to leverage data from an art log or author questionnaire to produce this metadata.

These guidelines add additional context to creating metadata for publication resources:
16. Captions for non-text features add meaningful context
22. Express metadata in an appropriate structured format
25. Express the license information in the resource-level metadata
26. Describe connections between resources in the metadata
27. Assign and use unique persistent identifiers for publication resources

Correct handling of character encoding can make an enormous difference to whether a publication is properly rendered. Encoding type should be expressed in the metadata, and/or within the publication as appropriate for the format. For example, websites may include encoding in the metatags and/or the charset property of the HTTP headers.

If a publication contains digital enhancements that are important enough to warrant preservation, the publication inclusive of its enhancements may be substantial enough to warrant a new ISBN, DOI, or other persistent identifier. This practice would ensure that the new version can be easily distinguished from other unenhanced versions of the publication in the preservation system.

These guidelines also relate to management of versions and use of identifiers:
9. Define the “version of record” in your context
17. Use persistent identifiers to link or cite external resources
23. Include version information in bibliographic metadata

XML Sitemaps containing links to all of the content in a website ensure that website archiving crawlers will be able to locate all of the content. Doing so may also improve search engine optimization. Sitemaps that are intended to facilitate web archiving should include links for all texts, resource landing pages, downloads, and views of the data i.e. API URLs that are called dynamically while the user is interacting with the page and each combination of query parameters that may appear.

This guideline will make creating a sitemap simpler:
46. For websites, give each page state its own URL

Successful website archiving is contingent on a harvester visiting each URL that forms the work. If a full list of URLs is not supplied to the harvesting tool via a sitemap or through other configuration, automation may be used to discover the URLs. Automated website crawling tools can easily identify the target of simple HTML <a> or <link> tags with a relative or full URL, and will include them in a crawl. Many websites, however, use JavaScript actions to fetch content. Crawlers may not be able to identify the URLs that are loaded by JavaScript causing the content to be missed during an automated archiving process. Similarly, hyperlinks that are within compiled features e.g. compiled 3D visualizations, can be difficult or impossible for a crawler to discover. When designing web content, consider the value of using simple HTML links so that crawlers can identify the URLs that make up a work. Note that as with <link> tags, the target URLs of <a> tags will likely be crawled even if they do not display text on the page, and so they can be used to guide a crawler to relevant content. Conversely, a crawler cannot determine which of these tags link to content that is not vital to the work, and so using these tags for other purposes or having hidden link tags that are never used can guide the crawler to things that may be out of scope for an archived copy of the publication, such as previous or unused iterations of a page.

This guideline may make changes for efficient crawling less critical:
43. Include a sitemap for all web-based publications

In an earlier version of the Manifold platform, the HTML for some of the buttons/links that are used to navigate the project were coded in a way that a web crawler would not be able to discover the target page of the link. Because web crawlers could not find these pages, portions of Manifold projects could be missing from a web archived copy. Web archive tools typically crawl a website by finding <a> tags in the HTML and then using the href= value of this tag to identify other pages that should be archived. Originally, the buttons that were used to page through the resources associated with a project in Manifold were all coded using <a href=”#”>. Clicking on this tag triggered some JavaScript that loaded the next set of resources into the page. This works fine for a human user, but for most web archive tools, this would lead them to retrieve the page https://{domain}/resources# instead of the various numbered resource pages that a user would see. In this case it resulted in only the first resource page being discovered for the archived version—the subsequent pages were not discovered, and so nor were the resources that were linked on those pages. After feedback from the embedding team, these links were re-coded to be in the format <a href=”?page=2”> which leads to a URL that can be visited by a web crawler and also bookmarked by human users, e.g., https://{domain}/resources?page=2. In another part of the system, instead of using <a> tags, the code used <div> tags that were styled to look like buttons and had onclick() actions in JavaScript that loaded new content. A web crawler looks for <a> tags to identify links, but <div> tags have many uses and a crawler would not “know” to click on one to reach new content, so these pages would be missed by web archive tools. These were changed to use the format <a href=”/projects/path-to-page”> that is understood by a web crawler. The content on those pages can now be automatically discovered and archived by a web crawler.

This can help facilitate a fully automated web harvest of content in situations where an export is not a feasible approach. Bibliographic metadata is a vital component of a publication preservation package. As with other metadata it’s best to use a broadly adopted standard such as Google Scholar, Dublin Core, or PRISM. Cover the core bibliographic information to make the publication findable, and be consistent. An expression of the material’s license, for example, through <link rel="license" href=...>, is valuable since this can support an archive’s understanding of whether the material can be preserved and how it can be reused. Note that HTTP Link headers can also be used to convey some metadata and can be applied to the HTTP Response of both HTML and non-HTML web resources. An approach to this is described on signposting.org.

These guidelines may also be relevant when generating bibliographic metadata:
21. Provide bibliographic metadata with exported publications
30. Bibliographic metadata in the context of EPUBs
40. The license for external resources can be expressed in HTML

The enhanced journal Technology, Mind & Behavior (TMB) is hosted on PubPub. Publishers can configure articles on PubPub to display the full article metadata in the <head> section of the web page HTML. If you inspect the HTML code of a PubPub-based TMB article, you will see that the <head> element of the document at the top of the document includes bibliographic metadata in the <meta> tags and has implemented several standards. One is citation_, which is used by Google Scholar to create search records, and another is dc which stands for Dublin Core, a widely used descriptive metadata standard. Including metadata in these formats supports archiving that is performed by automatic harvesting. It allows harvest tools to extract accurate descriptive metadata from the webpage of the article. Note that in the case of PubPub, the license is not in the <head> section, which would be ideal, but is expressed at the bottom of the page using the Creative Commons relationship. The anchor tag for that license has the format <a rel="license"> which indicates this connects to the license for the page.

Data driven websites can display different sets of resources from the server at the same URL. If different views of a page share the same URL, however, this means that retrieving a web page from a web archive could have unpredictable results. It is therefore helpful to ensure that, where reasonable, the URL reflects any filters or properties that change what is loaded into the browser from the server via the path or querystring (the part of the URL following the question mark). This allows the different states of a page to be bookmarked, but also makes it possible to utilize a sitemap to express the full range of resources that make up the website. While a sitemap can include API calls that might be used for dynamically generated views, sitemaps are easier to maintain if these views are also reflected in the browser’s address bar.

This is another guideline about making URLs web archive friendly:
49. Parameters should not be added to the URL unnecessarily

Key URLs for a publication, such as a publication’s home page, should not change over time. If they must change, redirect the original URL to the new location. Apart from helping to decrease broken links from other websites, using a well planned URL structure can help with website preservation. Ensuring the publication’s URL does not change over time can make it easier to manage and connect different versions of the publication that are preserved and avoid duplication.

These guidelines discuss identifiers, another way to support URL persistence:
27. Persistent identifiers can be used at the publication resource level
31. Persistent identifiers should assigned to new versions of the work

Where there are multiple publications on the same domain or subdomain, and each one spans multiple pages, using a consistent and hierarchical naming convention in the URL path helps web harvesting tools identify its scope. For example, if the publication content is organized in these directories: example.org/book-slug/text, example.org/book-slug/resources, a crawler can be set to generate an archive of the resources within the “book-slug” directory.

Website crawling and playback of web archives use URLs as unique references—this includes the query parameters (after the “?” and, for some tools, after the “#”). Adding parameters to the URL that do not affect what data is loaded from the server, or simply reflect a default where the page is the same with or without the property, complicates the capture and playback of the web archive and bloats the size of the crawl since every URL is captured as if it is a new page even if the content is identical.

This guideline is also useful for creating web archive friendly URLs:
46. Assign each unique page state one, and only one, URL

Language tags, such as ?locale=en for English, may be appended to the end of the URL to reflect the display language of the publication. If the default language for the publication is the same in the basic URL without the language tag, a web crawler will make a redundant copy by crawling both the basic URL and the URL with the tag. If the publication is available in multiple languages, the language tag could be used for only non-default languages, or every language could include a language tag.

Many modern websites depend on JavaScript to load data from the server as the user interacts with the site creating a dynamic experience. This can make it difficult for a web crawler to automatically create a functional copy of a web page since it may not be able to predict all user behaviors that pull new content from the server. Some web developers design websites using a “progressive enhancement” approach in which a baseline of functionality is supported for a variety of environments, including those with scripts disabled. Where this approach is used, the version of the site presented to the user will change if they choose to disable, or cannot support, JavaScript in their environment. They will instead see a scriptless version of the site that presents the core intellectual components of the page in a more static form. If this functionality exists or can be easily supported, it can serve as an alternative way to capture pages using web archiving in cases where the full dynamic version cannot be crawled automatically.

This guideline describes an alternative way to manage JavaScript-rich features:
53. For dynamic web page features, favor designs that pre-load data

A form of this concept can be found in the Fulcrum resource pages. For each type of media resource, there are two ways to access the content: you can either view it in an enhanced media viewer that is embedded in the page or download the file via the download button. For those who can access the enhanced viewers on Fulcrum, they provide additional functionality such as speeding up and slowing down video, zooming into images, and viewing highlighted audio transcripts while listening to audio. If those enhanced viewers aren’t available to the user, the resource pages also include a download button so that the user can copy the file to their own machine to view it. This is helpful to web archiving approaches, since this approach may fail to capture all of the complexity of the enhanced media viewers, but it will almost certainly be able to copy the downloaded file if it is linked from the same page as the viewer.

Platforms with good search engine optimization implement paths to navigate every page via links. This is also useful for web archiving since both search-engine crawlers and web-archiving crawlers use similar mechanisms to discover all pages of content.

These guidelines also help a website crawler discover all content:
43. A sitemap can help website crawlers reach unlinked content
44. Use simple links to help a website crawler find content
46. Ensure each page state has its own unique URL

For publications where some content should not be preserved, consider tagging what can be preserved in a consistent way that can be used by preservation export or harvesting processes to exclude items that should not be preserved. Platforms may want to facilitate this tagging.

These guidelines also concern the inclusion and exclusion of content in the preservation process:
10. Define and document core intellectual components that need to be preserved
20. Represent all core intellectual components of the work in the export package
40. Identify the rights for external web content
55. Consider whether it is ethical/appropriate to preserve social media
65. Ensure irrelevant or private administrative data is removed from data exports

To achieve a shared understanding between the publisher, authors, and preservation service about what can be preserved so that authors can make informed decisions about what enhancements to include in their publication, broadly describe preservation approaches for different types of content added to a platform. This documentation could indicate to authors, for example, that they should have appropriate rights to files uploaded into the system and that they will be shared with a preservation service. It might also define a platform’s approach to third-party content in iframes by stating that content in iframes may not be preserved or maintained. Alternatively, it could instruct authors that all content in iframes will be archived, so iframes should only be used if the content in them is owned by the author or they have rights that allow it to be harvested by a preservation service. Information about a platform-level approach can be incorporated into or connected to a Terms of Use document, or could be in the form of a publicly visible preservation policy.

See also:
6. Keep preservation partners informed of changes
10. Define and document the core intellectual components of a work